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Abstract 

Background: Knowledge-based potentials have been widely used in the last 20 years for fold recognition, protein 
structure prediction from amino acid sequence, ligand binding, protein design, and many other purposes. However 
generally these are not readily accessible online. 

Results: Our new knowledge-based potential server makes available many of these potentials for easy use to 
automatically compute the energies of protein structures or models supplied. Our web server for protein energy 
estimation uses four-body potentials, short-range potentials, and 23 different two-body potentials. Users can select 
potentials according to their needs and preferences. Files containing the coordinates of protein atoms in the PDB 
format can be uploaded as input. The results will be returned to the user's email address. 

Conclusions: Our Potentials 'R'Us server is an easily accessible, freely available tool with a web interface that 
collects all existing and future protein coarse-grained potentials and computes energies of multiple structural 
models. 



Background 

Prediction of protein three-dimensional structures from 
their amino acid sequences is one of the important 
goals of computational biology. The rate of determina- 
tion of protein structures by experimental methods such 
as nuclear magnetic resonance (NMR) spectroscopy and 
X-ray crystallography cannot, unfortunately, catch up 
with the extremely rapid growth of protein sequences 
from the mass-scale genome sequencing studies. Addi- 
tionally experimental structure determination methods 
are quite expensive both in terms of equipment and 
human effort, mostly because of difficulties in obtaining 
high quality protein crystals [1]. Because of this, the 
computational prediction of protein structure from 
amino acid sequence becomes increasingly important. 

There are two types of computational strategies for pre- 
dicting protein structure [2] : template-based protein struc- 
ture modeling and ab initio structure prediction. Ab initio 
methods try to build three-dimensional protein models 
"from scratch", and are based on physical considerations 
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rather than on the use of a previously solved individual 
structure. Ab initio procedures require significant compu- 
tational resources to perform searches throughout the 
whole conformational space to seek the lowest energy con- 
formers, and therefore are applicable only for relatively 
small proteins. Template-based protein modeling utilizes 
known protein structures as the starting points for struc- 
ture prediction. These methods may also be divided into 
two categories: comparative (or homology) modeling and 
protein threading (fold recognition). Homology modeling 
is based on the assumption that two homologous proteins 
have similar structures. When the query sequence has 
sequence identity of approximately 30% or higher in com- 
parison with a sequence having a known structure available 
from the Protein Data Bank (PDB), we use homology mod- 
eling to predict protein structure. When only distant 
homologs with low sequence identity to the query 
sequence can be found in PDB, we use protein threading 
algorithm to select a protein fold. The basic idea of protein 
threading is that the target sequence for which the struc- 
ture is being predicted is threaded through the backbones 
of a collection of template protein structures and energy 
scores are calculated for each sequence-structure 
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alignment using knowledge-based coarse-grained potentials 
such as four-body potentials [3], two-body potentials [4,5] 
and short-range potentials [6] employed in this server. 

All two-body potentials available on our server were 
previously analyzed in 2005 by Pokarowski et al. [7], 
who compared 29 different two-body potentials cur- 
rently used in computational biology, and approximated 
them with simple functions of the physical properties of 
amino acids. All these pair-contact potentials can be 
expressed as symmetric matrices. The best known repre- 
sentatives of the two-body potentials, are the MJ poten- 
tials, which were first introduced by Miyazawa and 
Jernigan in 1985 [4], and then rederived using an 
updated, larger protein dataset in 1996 [5]. Both papers 
are highly cited, according to the ISI Web of Science. 
MJ potentials were derived from the statistics of inter- 
residue contacts occurring in a set of proteins using the 
quasi-chemical approximation with an approximate esti- 
mation of the chain connectivity effects [4,5]. There are 
two kinds of MJ two-body contact potentials, ey and e^, 
denoted as MJ1-3 and MJlh-3h in our web server. The 
later one (marked with the suffix "h") includes the 
energy of transfer of amino acids from water to the pro- 
tein environment. The pair-contact energies e^ and ey 
(in k B T units, where k B is the Boltzmann constant and T 
is temperature) were derived based on the assumptions 

- - -2 

that exp(-^) = ^ and e«p(-^.)= ^ • Here 

indices i, j, and 0 represent residue i, residue / and sol- 
vent respectively, and Wy is the statistical average of the 

number of contacts n t j between residues of types i and /. 
More details of the derivation of MJ two-body contact 
potentials are given in reference [4] . Since the correlation 
coefficients between MJ2h and MJlh and between MJ3 
and MJ2, as shown in reference [7] are quite high, we 
only use MJ2h and MJ3 for threading purposes in our 
potential server. More details about all of the two-body 
potentials and their abbreviations used in our server can 
be found in Pokarowski's paper [7]. The four-body con- 
tact potentials and short-range interaction potentials 
have been derived by considering different aspects of pro- 
tein structures than those used to derive pair-contact 
potentials. The four-body contact potentials [3] are 
appropriate for representing the cooperative parts of the 
protein folding process, and we have shown that they are 
quite successful for recognizing the native structures 
among hundreds or even thousands of decoys from the 
Decoys'R'Us database [8]. Short-range interaction ener- 
gies allow us to estimate free energies from the statistical 
distribution of local conformational descriptors [6]. We 
usually assume that the lower the computed energy 
score; the better is the predicted structure in accordance 



with the thermodynamic hypothesis that the native state 

of_a protein has the lowest free, energy. . 
Potential energies are essential tor all protein structure 

prediction methods, and can be used either to guide the 
conformational search process, or to select a native struc- 
ture from a preselected set of possible models of the struc- 
ture. Protein contact potentials are also used in protein 
design, protein docking, simulations of folding, and in 
many other applications. Knowledge-based optimization 
potentials are usually derived from the known protein 
structures solved by X-ray crystallography or NMR, by fit- 
ting their values to optimize the recognition of the native 
structures from sets of computer generated structures 
(decoys) [9]. In contrast to atomic potentials based on real 
physical interactions, knowledge-based potentials incorpo- 
rate and average over many different physical interactions, 
such as hydrophobic, electrostatic, hydrogen-bond and 
cation-Ti interactions, and so these statistically derived 
potentials do not necessarily reflect true energies but 
rather are effective ones averaged over many of the atomic 
details. Results of the Critical Assessment of Techniques 
for Protein Structure Prediction (CASP) show that the 
groups using knowledge-based statistical potentials have 
been more successful for both ab initio structure predic- 
tion and template based modeling [10-15]. 

Although most of the potentials available on our ser- 
ver were derived a number of years ago, these potentials 
have never been collected nor made accessible to the 
public through a web server. Our knowledge-based 
potential server will overcome this deficiency and should 
be an extremely convenient location for any researcher 
to compute and compare energies of different protein 
conformations of the same protein. 

Implementation 

Figure 1 shows schematically how our potential server 
works. Our potential server computes energies for the 
supplied set of protein conformations. However it will 
not generate conformations for the submitted amino 
acid sequence. 

All 27 different knowledge-based potentials included 
in our server are listed on the top of the web page at 
http://gor.bb.iastate.edu/potential. The abbreviations of 
the potentials employed in our server are the same as in 
reference [7]. Details of each of the knowledge-based 
potentials and the related literature references can be 
accessed by clicking the corresponding name on the list 
of 27 potentials. All of these 27 different potentials 
except the general four-body potentials have been pub- 
lished in the literature over the last 25 years. The gen- 
eral four-body potentials (the second entry on the list of 
the potentials) are the newest (not yet published) modi- 
fications of our recently published four-body contact 
potentials [3] that are also included in the server (see 
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1 
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and send results to users via e-mial 

Figure 1 The flowchart for the potential server. 



the first entry on the list). We will update the literature 
referring to the general four-body potentials after the 
publication of the relevant paper. We should also note 
that our potential server is an ongoing project; the list 
of potentials included is neither final nor complete, and 
we will add more knowledge-based potentials (especially 
those uncorrelated with presently included ones) that 
have either been omitted now or that will be published 
in the future. We are open to comments and sugges- 
tions for including new knowledge-based potentials and 
for improvement of our web server. 

Results 

The Pearson correlation between four-body potentials 
and general four-body potentials is 0.62 (Figure 2). Since 
the short-range potentials and four-body potentials were 
derived in a completely different way compared from 
pairwise potentials, we can not calculate the Pearson 
correlation for them. Because the highly correlated pair- 
wise potentials may lead to redundant threading results, 
we show that there are pairwise potentials with low cor- 
relations below 0.3 (see Figure 2). The correlation 
between VD and MSBM is -0.24, the lowest correlation. 



The users may choose the most different potentials 
according to Figure 2. 

The potential server currently accepts only one file or a 
set of files corresponding to different conformations of the 
same protein in the PDB format. The users should com- 
press multiple PDB files into a single zip archive before 
submitting them to the server. Each PDB file should con- 
tain only a single conformation. The server will check 
whether the uploaded PDB files correspond to NMR or X- 
ray structures and will check the total number of atoms in 
a single PDB file. The server will not compute energies 
either for NMR structures or for PDB files having more 
than 25000 atoms, and will only send warning messages 
via email. If the uploaded zip file contains both NMR and 
X-ray structures, the server can recognize X-ray structures 
and will compute only energies for all qualified structures. 
If users want to estimate the energies of NMR structures, 
they should split the multiple models into separate PDB 
files before submitting them to the server. We allow the 
main program to run in the background, so users can 
close their web browsers once they finish uploading the 
files. The server first calculates the residue accessible sur- 
face area using NACCESS [16] that is used for energy 
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Figure 2 The correlations among various potentials. The circle nodes represent all pair-wise potentials, squares four-body potentials, and the 
diamond the short-range potentials used in this server. The edges represent the Pearson correlations between pairs of potentials. FB denotes 
four-body potentials, GFB general four-body potentials, and "Short" short-range conformational potentials. (See reference [7] for identities.) 



calculations with four-body contact potentials, and then it 
computes and outputs the energies for the four-body 
potentials, the two-body potentials and the short-range 
conformational potentials. Finally the server integrates all 
of these results into one text output file and sends this file 
to the email address supplied by the user. 

All the structural data files in the PDB format are ana- 
lyzed on a coarse grained level as follows: the geometric 
center of all side-chain heavy atoms from one residue is 
calculated to represent this residue, or if the PDB file 
has only backbone atoms, then the Ca atom is used to 
represent the residue. 

We have provided one example for users on the website 
to help them learn how to use the server. This example 
shows the practical application of our potential server for 
fold recognition. The supplied zip archive contains 25 
PDB files including lctf.pdb - the native crystal structure 
of the C-terminal domain of the ribosomal protein L7/L12 
from Escherichia coli at a resolution of 1.7 A [17] and 24 
other PDB files that are computer generated conforma- 
tions, or so called decoys, of lctf. The results returned by 
the server via e-mail show that the native structure has the 
energy lower than any other decoy when threaded for all 
potentials except TS, MJ1, and MSBM. The possibility to 
compute the energies of threadings by using a variety of 
knowledge-based potentials increases the reliability of fold 
recognition and may be used in the future to develop 
improved consensus predictions. 

For the above example containing 25 PDB files, it 
takes the sever about 10 seconds to complete the 



calculations. We have also tested the potential server on 
a much larger set of 1783 PDB files for a protein com- 
posed of 101 amino acids; in this case it took the server 
around 9 minutes to compute the results and return 
them by e-mail. Recently, we have tested the server 
using 2278 PDB files submitted simultaneously; and it 
took the server around 15 minutes to return the results 
by email. The size of the zip file for those 2278 files was 
47.1 MB. This shows that the server has the ability to 
compute a large number of pdb files at one time. It 
should be convenient for users requiring energies to be 
calculated for large numbers of computer generated 
conformations. 

The server consists of a Linux box with RedHat Enter- 
prise 3.0 operating system with 4.5 GB RAM and 140 
GB hard disk storage. The program code was written in 
C++ and the web interface has been developed using a 
CGI script written in HTML and PERL. We may make 
further improvements in the future to our server by 
upgrading its hardware and software for enhanced per- 
formance depending on the extent of users' demands. 
Users are encouraged to contact the system administra- 
tor via the e-mail provided on the web page to solve 
any possible problems or to suggest improvements to 
the functionality and performance of the server. 

Conclusion 

The knowledge-based potential server is an easily acces- 
sible, freely available tool with a web interface that col- 
lects all existing and future protein coarse-grained 
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potentials and computes energies of multiple structural 
models. It allows evaluation of energies of different pro- 
tein folds for non-computational biology specialists, and 
significantly improves the access to a wide variety of 
knowledge-based potentials. The server accepts multiple 
structural files in the PDB format (including hundreds 
or even thousands of decoys) and the results are sent 
back to users promptly at the supplied e-mail address. 

Availability and requirements 

Project home page: http://gor.bb.iastate.edu/potential 
Operating system: RedHat Enterprise 3.0 operating 

system with 4.5 GB RAM and 140 GB hard disk storage 
Programming language: C++, Perl, CGI script 
License: GNU GPL 

The potential server is freely accessible to all users. 

Abbreviations 

PDB: Protein Data Bank; NMR: nuclear magnetic resonance; CASP: Critical 
Assessment of Techniques for Protein Structure Prediction; For all 
abbreviations for two-body potentials: see reference [7] 
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